PSCI 3300.003 Political Science Research Methods
A. Jordan Nafa
University of North Texas
September 13th, 2022
Better understanding of correlation and causation
Ways of thinking about and formally expressing causal relationships
No class Thursday, I will be at a conference in Canada
Research Question Assignment is due Sunday, September 18th
Problem set two will be posted on Canvas by the end of the day today
\[ \definecolor{treat}{RGB}{27,208,213} \definecolor{outcome}{RGB}{98,252,107} \definecolor{baseconf}{RGB}{244,199,58} \definecolor{covariates}{RGB}{178,26,1} \definecolor{index}{RGB}{37,236,167} \definecolor{timeid}{RGB}{244,101,22} \definecolor{mu}{RGB}{71,119,239} \definecolor{sigma}{RGB}{219,58,7} \newcommand{normalcolor}{\color{white}} \newcommand{treat}[1]{\color{treat} #1 \normalcolor} \newcommand{resp}[1]{\color{outcome} #1 \normalcolor} \newcommand{sample}[1]{\color{baseconf} #1 \normalcolor} \newcommand{covar}[1]{\color{covariates} #1 \normalcolor} \newcommand{obs}[1]{\color{index} #1 \normalcolor} \newcommand{tim}[1]{\color{timeid} #1 \normalcolor} \newcommand{mean}[1]{\color{mu} #1 \normalcolor} \newcommand{vari}[1]{\color{sigma} #1 \normalcolor} \]
Correlation is the degree to which two or more features of the world tend to occur in tandem.
We’ll call these features of the world “variables”
If two variables \(\treat{X}\) and \(\resp{Y}\) tend to occur together or increase at the same rate, we would say they are positively correlated
If the occurrence of \(\treat{X}\) unrelated to \(\resp{Y}\), we would say these two variables are uncorrelated
If when \(\treat{X}\) occurs we are less likely to observe \(\resp{Y}\), we would say these two variables are negatively correlated
Consider the case of the resource curse in comparative politics–an alleged negative correlation between dependence on oil production and democracy
| Not a Major Oil Producer | Major Oil Producer | |
|---|---|---|
| Democracy | 78 | 15 |
| Non-Democracy | 58 | 26 |
Probability that a country is a democracy and a major oil producer
\(\Pr(\mathrm{Democracy} | \mathrm{No~Oil}) = \frac{78}{78+58} \approx 0.5735\)
\(\Pr(\mathrm{Democracy} | \mathrm{Oil}) = \frac{15}{15+26} \approx 0.3659\)
Consider the case of the resource curse in comparative politics–an alleged negative correlation between dependence on oil production and democracy
| Not a Major Oil Producer | Major Oil Producer | Pr(Oil) | |
|---|---|---|---|
| Democracy | 78 | 15 | 0.1613 |
| Autocracy | 58 | 26 | 0.3095 |
| Pr(Democracy) | 0.5735 | 0.3659 |
Description
Suppose we want to know whether countries where gender equality is higher are more democratic on average
We might be interested in the correlation between gender equality and democracy
If our data is good, we can provide a yes or no answer to this question with few assumptions needed
We can estimate this correlation in R with a simple linear model and data from the Varieties of Democracy project’s {vdemdata} package
We can estimate this correlation in R with a simple linear model and data from the Varieties of Democracy project’s {vdemdata} package
# Get the data we need from the vdemdata package
vdem_df <- vdemdata::vdem %>%
# We'll use just the year 2018 here for simplicity
filter(year == 2018) %>%
# Transmute a subset of the data for plotting
transmute(
country_name,
v2x_polyarchy = v2x_polyarchy*10,
v2x_gender = v2x_gender*10
)
# Estimate the linear relationship
lm_democ_gender <- lm(v2x_polyarchy ~ v2x_gender, data = vdem_df)
# Print a summary of the result
broom::tidy(lm_democ_gender)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.03 0.486 -6.24 3.14e- 9
2 v2x_gender 1.12 0.0642 17.5 2.35e-40
Prediction and Forecasting
Suppose we want to predict which financial transactions are likely to be fraudulent
With large amounts of data on legitimate consumer transactions, we could develop a model that detects anomalies that are likely to be fraudulent
Major financial firms have entire teams dedicated to predicting and preventing fraud in this manner
Other examples of prediction include election forecasting, self-driving cars, and many other tasks
Causal Inference
Suppose we want to know if high school students would be more successful in life if they were forced to take calculus
The observed correlation between calculus and future success might be useful
But we would have to assume that the students taking calculus are otherwise the same as everyone else in terms of their underlying chances of success
Aside from very special circumstances, this kind of assumption will be hard to defend
As a general rule, correlation does not imply causation
For every variable we observe, we can compute a number of different statistics. Three that are particularly useful for understanding data are mean, variance, and standard deviation
Mean \(\mean{\mu}_{\treat{x}}\): \(\frac{\sum_{\obs{i}=\obs{1}}^{\sample{n}} \treat{x}_{\obs{i}}}{\sample{N}}\)
Variance \(\vari{\sigma}_{\treat{x}}^{2}\): \(\frac{\sum_{\obs{i}=\obs{1}}^{\sample{n}} (\treat{x}_{\obs{i}} - \mean{\mu}_{\treat{x}})^{2}}{\sample{N}}\)
Standard Deviation \(\vari{\sigma}_{\treat{x}}\): \(\sqrt{\vari{\sigma}_{\treat{x}}^{2}}\)
The mean and variance are known as the first and second moments of a variable’s distribution
Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R
Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R
Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R
Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R
[1] 7.355531
[1] 1.81225
[1] 3.284252
Continuing with our previous example, we can calculate the mean, standard deviation, and variance of a numeric variable in R
[1] 7.355531
[1] 1.81225
[1] 3.284252
These are univariate statistics, meaning they describe the distribution of a single variable
Now that we’ve learned some notation, and we can compute useful statistics for a single variable, we can start thinking about how to measure the correlation between two variables
One useful measure of correlation is the covariance
A second is the correlation coefficient
We can calculate the covariance and correlation coefficient for two numeric variables in R
We can calculate the covariance and correlation coefficient for two numeric variables in R
We can calculate the covariance and correlation coefficient for two numeric variables in R
We can calculate the covariance and correlation coefficient for two numeric variables in R
We could calculate the correlation and covariance manually as well for illustrative purposes
We could calculate the correlation and covariance manually as well for illustrative purposes
We could calculate the correlation and covariance manually as well for illustrative purposes
We could calculate the correlation and covariance manually as well for illustrative purposes
We could calculate the correlation and covariance manually as well for illustrative purposes
# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)
# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)
# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y
# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)
# Covariance of x and y
(cov_result <- sum_cov_xy/length(cov_xy))[1] 3.661598
We could calculate the correlation and covariance manually as well for illustrative purposes
# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)
# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)
# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y
# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)
# Covariance of x and y
(cov_result <- sum_cov_xy/length(cov_xy))[1] 3.661598
We could calculate the correlation and covariance manually as well for illustrative purposes
# (x_{i} - mu_{x})
cov_x <- vdem_df$v2x_gender - mean(vdem_df$v2x_gender)
# (y_{i} - mu_{y})
cov_y <- vdem_df$v2x_polyarchy - mean(vdem_df$v2x_polyarchy)
# (x_{i} - mu_{x})(y_{i} - mu_{y})
cov_xy <- cov_x*cov_y
# Sum of (x_{i} - mu_{x})(y_{i} - mu_{y})
sum_cov_xy <- sum(cov_xy)
# Covariance of x and y
(cov_result <- sum_cov_xy/length(cov_xy))[1] 3.661598
# sigma_{x} sigma_{y}
sigma_xy <- sd(vdem_df$v2x_gender)*(vdem_df$v2x_polyarchy)
# Correlation
cov_result/(sd(vdem_df$v2x_gender)*sd(vdem_df$v2x_polyarchy))[1] 0.7910286
The correlation coefficient tells us about the tightness of the relationship between two variables.
But we often care more about the substantive magnitude of the relationship. How much does \(\resp{Y}\) vary as \(\treat{X}\) varies?
To answer this question, we want to know the slope of the regression line.
\(\beta = \frac{Cov_{\treat{x}, \resp{y}}}{\vari{\sigma}_{\treat{x}}^{2}}\)
On average, for every one-unit increase in \(\treat{X}\), \(\resp{Y}\) increases by…
An unfortunate fact of life, though one some social scientists all too often fail to appreciate, is that not everything is approximately linear and additive
In cases of non-linear relationships, calculating the linear correlation between two variables will give us the wrong answer
If you make stupid assumptions, you will get stupid results
Causation need not imply linear correlation
Lots of interesting relationships are non-linear and there are ways of analyzing these relationships
Easiest way to to illustrate the problem is usually through simulation
Easiest way to to illustrate the problem is usually through simulation
Then we can use ggplot2 to make scatter plots of the relationships in the simulated variables
# Initiate the plot object
posquad_plot <- ggplot(nonlinear_sims, aes(x = x, y = y_posquad, fill = x)) +
# Add the data points
geom_point(shape = 21, size = 3) +
# Add the linear fit
geom_smooth(method = "lm", size = 2, se = FALSE, lty = 2, color = "white") +
# Tweak the fill color scheme
scale_fill_viridis_c() +
# Labels for the plot
labs(
x = "X",
y = "Y",
title = latex2exp::TeX(r'($Y_{i} = X_{i} + X_{i}^{2} + \epsilon_{i}$)')
) +
# Adjust the x axis scales
scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
# Adjust the y axis scales
scale_y_continuous(breaks = scales::pretty_breaks(n = 8))# Initiate the plot object
negquad_plot <- ggplot(nonlinear_sims, aes(x = x, y = y_negquad, fill = x)) +
# Add the data points
geom_point(shape = 21, size = 3) +
# Add the linear fit
geom_smooth(method = "lm", size = 2, se = FALSE, lty = 2, color = "white") +
# Tweak the fill color scheme
scale_fill_viridis_c() +
# Labels for the plot
labs(
x = "X",
y = "Y",
title = latex2exp::TeX(r'($Y_{i} = X_{i} - X_{i}^{2} + \epsilon_{i}$)')
) +
# Adjust the x axis scales
scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
# Adjust the y axis scales
scale_y_continuous(breaks = scales::pretty_breaks(n = 8))# Initiate the plot object
sin_plot <- ggplot(nonlinear_sims, aes(x = x, y = y_sin, fill = x)) +
# Add the data points
geom_point(shape = 21, size = 3) +
# Add the linear fit
geom_smooth(method = "lm", size = 2, se = FALSE, lty = 2, color = "white") +
# Tweak the fill color scheme
scale_fill_viridis_c() +
# Labels for the plot
labs(
x = "X",
y = "Y",
title = latex2exp::TeX(r'($Y_{i} = Sin(X_{i}\cdot \pi) + \epsilon_{i}$)')
) +
# Adjust the x axis scales
scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
# Adjust the y axis scales
scale_y_continuous(breaks = scales::pretty_breaks(n = 8))All text and images in this course are made available for public non-commercial use under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.
All R, HTML, and CSS code is provided for public use under a BSD 3-Clause License.
The files and code necessary to reproduce the content of this course are or will be made available via the course’s github repository with the exception of those covered by existing commercial copyright restrictions (i.e., copies of the assigned readings for the course).